Unsupervised natural language processing in the identification of patients with suspected COVID-19 infection

Patients with post-COVID-19 syndrome benefit from health promotion programs. Their rapid identification is important for the cost-effective use of these programs. Traditional identification techniques perform poorly especially in pandemics. A descriptive observational study was carried out using 105,008 prior authorizations paid by a private health care provider with the application of an unsupervised natural language processing method by topic modeling to identify patients suspected of being infected by COVID-19. A total of 6 models were generated: 3 using the BERTopic algorithm and 3 Word2Vec models. The BERTopic model automatically creates disease groups. In the Word2Vec model, manual analysis of the first 100 cases of each topic was necessary to define the topics related to COVID-19. The BERTopic model with more than 1,000 authorizations per topic without word treatment selected more severe patients - average cost per prior authorizations paid of BRL 10,206 and total expenditure of BRL 20.3 million (5.4%) in 1,987 prior authorizations (1.9%). It had 70% accuracy compared to human analysis and 20% of cases with potential interest, all subject to analysis for inclusion in a health promotion program. It had an important loss of cases when compared to the traditional research model with structured language and identified other groups of diseases - orthopedic, mental and cancer. The BERTopic model served as an exploratory method to be used in case labeling and subsequent application in supervised models. The automatic identification of other diseases raises ethical questions about the treatment of health information by machine learning.


Introduction
The COVID-19 1 pandemic reinforced the historical concern of researchers regarding the threat of new viruses and mutation of existing ones.It implied pressure on already overburdened health care services 2 , by severe forms of the disease (approximately 25% of vulnerable patients or patients with comorbidities) and a high mortality rate (5.6% in the firstwave 3 ).Additionally, structural changes in health care services, greater impact on low-and middle-income countries 4 , ethical conflicts in the prioritization of care 5 and financial challenges accentuated their impact.Challenges were aggravated by the emergence of long COVID-19 or post-COVID syndrome 6,7 , which affects 10% to 30% of patients 8 .New pandemics are expected to emerge in the future 9 and early identification of patients will be important for correct and cost-effective adoption of care.
The treatment of information is a challenge, due to its increasing volume 10 or due to the peculiarities of the different areas of knowledge.In health care, data are incomplete, heterogeneous, multidimensional, unstructured and inaccurate 11,12 .To address these challenges, it was proposed the discovery of knowledge through KDD (knowledge discovery in database) in the mining (data mining) of large volumes of data (big data) 13,14 .
Machine learning (ML) techniques enable the algorithm to learn patterns that are unidentifiable by classification or prediction techniques 15 .This learning can be supervised -with labels that classify the object of study -or unsupervised -with no classification.In this case, exploratory techniques are used for the creation of labels and subsequent application of supervised techniques 15 .The labeling of medical data is difficult and depends on specialized work, being a limiting factor in studies of the pandemic 16 .Thus, unsupervised exploratory techniques are an important step in the application of ML on large volumes of data for knowledge discovery.
Text data mining refers to the discovery of patterns as proposed by Fayyad et al. 10 , while natural language processing (NLP) is seen as a branch of artificial intelligence that deals with human language 17 or makes this language understandable to computers 18 thereby enabling different approaches, including the grouping of texts by topics ("topic modeling").Topics are groups of similar objects, being a particular case of clustering.
Health care providers process data necessary for regulatory 19 and health care cohesion.Among them, prior authorization is the process of verifying the eligibility of patients and the coherence between the disease and treatment.It is requested before health care.This process is indirectly regulated by the Brazilian National Supplementary Health Agency (ANS) by guaranteeing service deadlines 20 .
Prior authorization analysis provides an opportunity for early patient selection.However, due to medical confidentiality, there is no information on the International Classification of Diseases, 10 th revision (ICD-10).Also, the requested care procedures do not allow the correct correlation with the disease to be treated and the complementary information of the prior authorization is not structured.Therefore, there is an opportunity for innovative solutions in the identification of patients in health care providers in Brazil.This is an important economic sector that covers approximately 25% of the Brazilian population with expenditures equivalent to 5.7% of gross domestic product (GDP) 21 .
There are few studies using NLP in health care in Brazil.Duval et al. 22 built a pharmacosurveillance system using twitter to detect adverse events caused by drugs -they used as a model the drug doxycycline for the treatment of malaria.Moreira et al. 23 proposed a hybrid model through which NLP created patient clusters using unstructured data.These clusters were incorporated into structured data, improving the accuracy of the diagnosis of patients with suspected dementia 23 .Diniz et al. 24 created a mobile phone system to identify patients with suicidal ideation that allowed the individual quantification of moment-to-moment risk ("digital phenotyping") enabling the action of health care professionals.
No studies using supplementary health care data were found, probably due to the difficulty of access to data in this health care sector, limited by barriers of professional and commercial secrecy.This study fills this gap and contributes to the application of ML methods in free software through a real case study.

Database and variables studied
Each authorization contains a blank text field, "clinicalindication", in which the reason or justification for the prior authorization request is indicated.Filling in this field is not mandatory.The provider may only attach documents justifying the request for the procedure.In this case, it is common to fill in the field with text "attached" or not to fill it in.The "clinicalindication" variable is the variable of interest in this study.
Prior authorizations issued between September 1 st , 2019 and June 30, 2022 were selected (n = 742,901).Those missing the justification (missing values) in the "clinicalindication" field (n = 558,530, 75%) were excluded.Therefore, 184,371 (25%) prior authorizations were included in this study, of which 105,008 contain payment information.Each prior authorizations contains at least one health care event identified in the event structure and event description variables corresponding respectively to the code of the requested event and its description.Authorizations are classified according to: type ("treatmenttype"), regime ("treatmentregime") and objective of care ("treatmentobjective").Filling in the ICD-10 field is not mandatory.They have an expiration date ("expirationdate") and can be canceled, reissued or revalidated according to the provider's administrative criteria.Box 1 contains the variables present in the database and used in this study.

• BERTopic model
The BERTopic model is an unsupervised algorithm for vector-based topic modeling.Topic modeling is a mining method whose objective is to discover hidden patterns considering the context and classify the respective texts into similar groups 25,26 , called topics.

Box 1
Variables from the prior authorization database of a private health care provider.São Paulo, Brazil.
SADT: diagnostic and therapeutic support services; TISS: information exchange in supplementary health.
* The transformation of this variable is described in the text of the article.It is the variable of interest for natural language processing analysis.
Initially, each document, in this case prior authorizations, is converted to its vector representation (word embedding) using the Bidirectional Encoder Representations from Transformers (BERT) model.The dimensionality of this representation is reduced using the Uniform Manifold Approximation and Projection for Dimension Reduction (UMAP) technique and the Density-Based Clustering Based on Hierarchical Density Estimates (HDBSCAN) algorithm is applied to create topics of documents that are semantically similar 27 .For the description of each topic, we used the term frequency -inverse document frequency (TF-IDF) 28,29,30,31 method.Documents not classified by the model are grouped into a specific topic containing outliers.In this work, the methods were applied through a free library based on Python 28 called BERTopic.
Two parameters were used to define the minimum number of authorizations in each topic created: 500 or more (BERTopic +500) and 1,000 or more (BERTopic +1,000) defined in the min_top-ic_size parameter of the model.Since it is an automatic model, the total number of topics created depends on this parameter.The language parameter was defined as multilingual for modeling the Cad.Saúde Pública 2023; 39 (11):e00243722 text in Portuguese and the vectorization model -embedding_model -as all-MiniLM-L6-v2, which is the standard of the model.
To identify the topics belonging to COVID-19, the get_topic_info() method of the model itself was used, which generates the automatic description of the topic.

• Word2Vec model
Word2Vec is an NLP model that uses neural networks to learn the representation of words (word embedding) in a high-dimensional vector space, capable of capturing the semantic and syntactic context of words in a given text corpus.For the comparative analysis, we used the continuous Bag-of-Words 32,33 model of the Word2Vec algorithm.The texts of the "clinicalindication" variable were separated into words (tokens) using the NLTK library (Natural Language Toolkit -https://www.nltk.org/), on which we applied the Word2Vec algorithm from the Gensim library (https://pypi.org/project/gensim/), using a vector size equal to 300, recalculated considering their average and categorized into 20 clusters using the K-Means algorithm.These clusters were considered the topics of this model.This method does not automatically assign names to topics.To identify clusters with suspected cases of COVID-19 infection, each of the 20 clusters was manually analyzed by the main researcher.To this end, the first 100 authorizations classified in descending order of expenditure were selected in each cluster.Each text present in the clinical indication variable was analyzed and the respective cluster was classified, or not, in the COVID-19 group.
Each of the two models was applied to the descriptions -treated or not -contained in the prior authorization of the variable "clinicalindication".The treatment of the variable is recommended to improve the performance of the Word2Vec model.
The treatment of the clinical indication variable occurred as follows: conversion of all words into lowercase, removal of stopwords in Portuguese, exclusion of most common words in health and exclusion of special characters.No accents or other features of Portuguese were replaced.The words COVID-19 and SARS-CoV-2 were turned into covid.The ICD-10-related words present in the clinical indication variable were also standardized.

Evaluation of the quality of the classification generated by the models
Thus, we reached 6 different types of models: BERTopic +500, BERTopic +1,000 and Word2Vec, each with and without text treatment of the clinical indication variable (treated and untreated).
To assess the quality of the classification, the main author analyzed the BERTopic +1,000 model because it presented the highest average cost per authorization.Thus, the first 100 authorizations classified as suspected or COVID-19-related events by this model were ordered in descending order of cost.The clinical indication text of each of these authorizations was manually analyzed by and classified it into classes of interest for study.This manual classification was compared to the automatic classification generated in this model.
For comparison with traditional structured query language (SQL) research methods, all prior authorizations containing the words covid, sars, coronavirus and coronavírus in uppercase or lowercase letters were selected and compared with the models generated using the authorization number as a binding index and identifying whether they were part of the groups identified as suspected COVID-19 infection.

Prior authorization cost
Prior authorization cost corresponds to the health care expenditures of each prior authorization.The payment basis contains the expenses paid to service providers net of disallowance.Costs were obtained using the prior authorization number as the connecting key.
The total amount paid corresponds to the sum of all expenses in the period from September 2019 to July 2022 found in the payment basis for each prior authorization.The number of authorizations paid corresponds to the count of authorizations with an amount spent per authorization greater than BRL 0.00.
Cad. Saúde Pública 2023; 39(11):e00243722 The average cost per paid authorization corresponds to the ratio between authorization expenditure and the number of paid authorizations.In this study, the most severe cases were those with the highest average cost per prior authorization.Expenditures are presented in reais and without inflation adjustment.
Access to data was granted through a confidentiality and scientific cooperation agreement with the provider and approved by the Research Ethics Committee of Ribeirão Preto School of Medicine, São Paulo University (HCFMUSP/RP; protocol n. 55685722.9.0000.5440).

Results
A total of 742,901 authorizations were issued in the 34 months analyzed, of which 184,371 (24.9%) were filled in with at least one number or word, are part of this study and were analyzed.Of these, 105,008 were paid authorizations (14.1%).The total expense in the period was BRL 374,089,836.This expenditure is right skewed (R(105,008) = 0.438 p = 0.000 -skewness 41.3) (Figure 1).
The most frequent health care events in the analyzed authorizations were: emergency room consultation (6.1% of the analyzed authorizations contain this event), individual psychotherapy session (5.7%) and RT-PCR screening for COVID-19 (5%).A total of 96.2% of the prior authorizations have no description of ICD-10 and only 587 (0.3%) have ICD-10 B34.2 -"Coronavirus infection, unspecified".
As for treatment type, 90.7% were clinical treatments, 7.8% surgical and 0.3% obstetric.Regarding the health care regime, 81% were outpatient care, 16.9% hospital care, and 1% home care.Inpatient clinical care corresponded to 15,741 authorizations -8.5% of the total (Table 1).
Regarding the objective of care, 75.1% were for diagnosis and 6.5% reparative treatment -18.3% of the prior authorizations had no objective of care filled in.In the outpatient regimen, the diagnostic objective was more frequent (80.6%).In the hospitalization regimen, there is an important group of reparative care (34.5%) (Table 2).
In the topics classified as COVID-19, the untreated BERTopic models presented higher average costs per paid authorization -BRL 10,205 in the one with more than 1,000 authorizations and BRL 10,138 in the one with more than 500 authorizations per topic.They correspond respectively to 1.9% (1,987) and 2.3% (2,443) of the authorizations paid and expenses of BRL 20.3 million (5.4% of total expenditure) and BRL 24.8 million (6.6%) respectively.The two models showed a significant number  BERTopic +500 = minimum 500 authorizations per topic; BERTopic +1,000 = minimum of 1,000 authorizations per topic.
Note: the Word2Vec model classifies all authorizations and therefore.
* Outliers correspond to authorizations not classified in topics by the model.
of paid authorizations considered discrepant -58.8% (61,723) in the BERTopic +1,000 model and 48.3% (50,716) in the BERTopic +500 model (Table 3).With the treatment of the "clinicalindication" variable, there was an increase in the number of authorizations of suspected cases of COVID infection in the BERTopic model with more than 500 authorizations (to 3.3% of the total authorizations paid) and a decrease in the model with more than 1,000 authorizations (1.7%) followed by a significant reduction in the total expenditure -BRL 5.2 million and BRL 14 million, respectively, when compared to the same models without word treatment, resulting in a decrease in the average costs per authorization in the two models.There was a decrease in the number of prior authorizations considered discrepant -although still high (36.3% in the BERTopic +1,000 model and 45.2% in the BERTopic +500 model) (Table 3).
The treatment of the "clinicalindication" variable substantially modified the indicators of the Word2Vec model.For cases classified as COVID-19, without treatment, this model presented lower numbers for paid authorizations (n = 1,005, 0.5%), total expenditure (BRL 4,909,189, 1.3%) and average cost per authorization (BRL 4,885) than those for the model with word treatment: 5,989 -5.7%, BRL 30.1 million -8%, and average cost of BRL 5,021, respectively (Table 3).
The comparison between the 06 models showed that the BERTopic +1,000 model without treatment has a lower number of authorizations classified as suspected covid with high total expenditure and the Word2Vec model with treatment has a higher number of authorizations classified as suspected covid with higher total expenditure (BRL 30 million), but resulting in a lower average cost (Table 3).
The evaluation of the classification quality of the BERTopic +1,000 model shows that, of the first 100 cases analyzed manually, 70 are related to suspicion of or infection by COVID clearly indicated in the text of the clinical indication variable.These patients had expenditure of BRL 11.5 million -56.5% of the total expenditure identified in this model (Box 2).
Other 20 patients have signs, symptoms or respiratory diseases that may or may not be related to COVID.The expenditure in this group was BRL 2.5 million.Other 8 cases are of newborns with respiratory distress all with no connection to the disease except one extreme newborn born to a mother with COVID.The other 2 cases present respiratory signs and symptoms unrelated to the disease (Box 2).Box 3 shows the first 15 authorizations of this quality assessment with the original description of the prior authorization, the respective manual classification and expenditure per authorization.The analysis of the first 100 cases is shown in the Box 2.
Traditional method: structured query language (SQL) considering the presence of uppercase and lowercase words covid, sars, coronavirus and coronavírus.
Note: the traditional method found 3,703 authorizations with a total expenditure of BRL 23,611,018.
The traditional method using SQL and selection of prior authorizations containing the words covid, sars, coronavirus and coronavírus resulted in 3,703 authorizations paid with a total expenditure of BRL 23,611,018 -average cost of BRL 6,376.
By comparing the traditional method with the generated NLP models, there are selected prior authorizations not classified by the models, cases of interest that were lost.These authorizations spread across the different topics of the models but concentrated in the topic with outliers, where it is not possible to make the classification.
In the BERTopic models, the greatest loss of cases occurred in the untreated model with more than 1,000 authorizations -2,377 (64.2%) authorizations were not classified by the model, had a total expenditure of BRL 8.7 million and an average cost of BRL 3,673.The BERTopic model with more than 500 authorizations without treatment was little better -1,622 (43.8%) unclassified authorizations, expenditure of BRL 5.1 million and average cost per authorization of BRL 3,214.These lost cases have an average cost per authorization almost 3 times lower than those classified by the models.The treatment of the words caused these models to stop classifying the less severe cases, the average costs per authorization of the lost cases were BRL 9,323 and BRL 7,217 in the BERTopic +1,000 and BERTopic +500 models respectively.
On the other hand, the models classified authorizations not selected in the traditional method.The 362 authorizations in excess in the BERTopic +500 untreated model that do not contain the words of the traditional search have an average cost of BRL 17,196 -an expense of BRL 6.2 million.In the BERTopic +1,000 untreated model, prior authorizations with the same characteristic (661 authorizations) have an average cost of BRL 8,165 and a total expense of BRL 5.4 million.The Word2Vec model with the best performance in this regard -2,703 authorizations with expense of BRL 11,369,283 and average cost per authorization of BRL 4,206 -is the treated model (Table 4) The BERTopic models generated other topics of interest -related to cancer (1,500 prior authorizations and BRL 6,662,411 spent), orthopedic diseases (4,531 prior authorizations and BRL 13,675,723 spent) and mental illnesses (3,603 prior authorization and BRL 818,893 spent).These topics vary depending on the method employed -the BERTopic +1,000 models, treated or untreated, were worse generating few additional topics

Discussion
The BERTopic model without word treatment selected more severe patients while the Word2Vec model with word treatment selected less severe patients.As early as 1998, Hernández & Stolfo 34 discussed the difficulty of working with real-world data.This challenge is greater with the use of unstructured data.The 100 cases manually analyzed show differences in how to name the virus, amplified by the peculiarities of the Portuguese language -accents, for example.Another challenge is the breadth of information -most authorizations were filled out with sentences of up to 5 words.
Still, the BERTopic model was able to select cases with the description "flu-like symptoms for 10 days.Respiratory distress.With tachydyspnea" as suspected virus infection.It is observed that there is no explicit mention of COVID-19 and while respiratory has accent, tachydyspnea does not, an example of the problem of unstructured data.This difficulty should explain why there are few studies using NLP applied to early detection of the disease.In a review of the use of artificial intelligence tools applied in the response to the pandemic, Syrowatka et al. 35  the use of NLP in the pandemic showed the use of topic modeling applied in the search for literature related to COVID-19 and non-adherence to social distancing with use 36 .
In a study comparing different topic modeling methods in social media, Egger et al. 37 showed that the BERTopic model better separated the topics and its analysis tools enable a better understanding of the interrelations between the topics.Such tools are visual and the authors state that the topics require human interpretation 37 .
As for human participation, a holistic and multidisciplinary view is needed, based on the human interpretation of the topics (modeling dimension) and the well-being of the patient (health dimension) considering financial aspects (economic dimension).
As an example of the challenge of this holistic view, it is observed that the models studied have opposite behaviors: one selects severe cases and the other selects less severe cases.The implementation of a health promotion program in the context of post-COVID-19 syndrome is much greater than the simple interpretation of the topics generated by an automatic model.It is a multidisciplinary enterprise also comprising the design of the program, identification and correct allocation of patients, their monitoring, evaluation of outcomes and financial results.
Post-COVID-19 syndrome patients require a wide gamut of special care ranging from reestablishment of previous health conditions to rehabilitation 38 .In this context, it is important to note that automatically generated models and the interpretation of their topics, although interesting, are part of a process that is highly dependent on people.Although, in the health care field, human resources are specialized and expensive, human participation is essential, not only interpreting the topics generated but also designing the entire program in line with this interpretation.It is worth using an NLP model in the early identification of diseases as long as a multidisciplinary team conducts the task of providing patients with quality, accessible and sustainable health care.
Specifically considering the informational dimension, an unsupervised model, especially when there is no word treatment, has some advantages.It is not influenced by the researcher.Another advantage is serving as support for the supervised models being employed as exploratory techniques 39 .The necessary human interpretation is perfectly consistent in a flow of patient discovery with the following steps: (1) unsupervised exploratory analysis -object of this study; (2) human interpretation and labeling based on the program design; (3) classification of cases; (4) application of labels in a supervised model with discovery of new patients.A supervised model has better performance and direct measures of quality assessment for classification, but the lack of labels on unstructured information makes its applicability very difficult.
In this study, we used two indirect quality assessment methods.In the first, there is human analysis and classification of authorization requests of the BERTopic +1,000 model, selected because of their possible greater severity and simulating the step of classification of cases by specialist.This practical exercise shows the dependence on human interpretation.While most cases (90%) would be of interest for careful evaluation through contact with patient for example, others were clearly misclassified (e.g., "respiratory distress").However, they are still interesting -one of the cases is a premature newborn from a mother infected by COVID-19 -whose analysis may lead to a specific program for pregnant women in this pandemic period.
The second indirect quality assessment method used structured query language (SQL), indicating that BERTopic models lose a significant group of suspected patients.These cases were less severe.The loss was not resolved with a change in the number of documents per topic -there was an increase in outliers -nor with the treatment of words -the groups became less identifiable.These non-classified cases reinforce the need for a semantic context to apply the method that is associated with the quality of the information in the authorization request.Only 25% of prior authoriztions have some information and of these, most have few words, making contextual analysis by the method difficult.There is an old discussion about data quality and its solution in the process of knowledge discovery in databases -KDD 10 .The use of real databases, such as the one used here, has great potential, and can even be used in evidence based on real data provided that the limitations imposed by quality are corrected 40,41 .
The Word2Vec model performed better with word treatment when compared to traditional methods, in part because the treatment involved standardizing the COVID-19 words written in different ways.Although advantageous, this exposes the difficulty of maintaining such a model and it is necessary to consider whether traditional search using SQL would not be better than this model addressed.
Cad. Saúde Pública 2023; 39 (11):e00243722 However, it should be considered that traditional methods for extracting data from texts are subject to human errors, a priori choice of words present in this text requires specialized knowledge 42 and may not fully take advantage of real-world information.Traditional database analysis options for identifying patients with certain diseases in providers are limited -ICD-10 are not informed and paid procedures do not allow the identification of the treated disease (e.g., lung computed tomography is paid in the same way for cancers, infections and checkup).There remains access to a wide range of unstructured information in which new methods, even if they need adjustments, can be more effective.
It is observed that, in this real setting with low quality of information, high volume of prior authorizations with missing values or filled in with only one word, the study demonstrated the viability of an unsupervised model for the analysis of prior authorizations from health care providers without any previous treatment with the use of software that is free, easy to use and easy to implement.This type of model is especially useful in the Portuguese language, in which coronavirus and coronavírus are different words for the computer but with identical meanings.It also addresses phrases such as -"HR: 65BPM RR: 26BPM BP:100/57MMGH SAT: 95% on RA. maintained respiratory distress" because it "understands" that respiratory distress may be related to COVID-19.
Unexpectedly, the model generated other groups of interest.Notably a group of cancer patients in which the topic formed practically describes the diagnosis attributed to patients -"neoplasm, malignant, breast" and groups of patients with orthopedic problems and mental disorders.These are patients who can certainly benefit from health promotion programs.
On the other hand, an unsupervised model selected prior authorizations belonging to cancer patients.This raises serious concerns about the ethical and responsible handling of information.This work highlights the problems that these models can cause in the ethical field 43 especially by focusing on the technical application of NLP disregarding the human dimension.There is a need for broad human participation in different stages of the creation of a health promotion program for patients with post-COVID-19 syndrome.This does not make the method less important; it only reinforces the need for human control.
To the best of our knowledge, this is the first study employing this technique using supplementary health care data in Brazil.

Study limitations
It is a model that cannot be much generalized due to factors such as: (i) being a proprietary base; (ii) difficulty in accessing information due to ethical and legal secrecy; and (iii) the use of the model trained in non-medical corpus in English.We also observed an important amount of authorizations with semantically poor descriptions, impairing the classification.The quality assessment of the model depended on manual analysis by the main researcher, which may introduce a bias that is mitigated by the exposure of the information and its classification.

Additional studies
The model should be enhanced by supervised method with the inclusion of labels created by specialists.It can also be enriched with other machine learning methods, such as the analysis of the images attached to the authorizations.It is necessary to discuss the ethical aspects of applying automated models, especially when they classify people into disease groups.It is necessary to assess the impact of treatment regimens and objectives (e.g., outpatient and diagnostic) on the behavior of the models.It is necessary to conduct further studies on the interrelation of different dimensions of knowledge and respective professionals in the provision of integrative, collaborative and sustainable care.

Conclusion
The BERTopic model without word treatment selected more severe patients with suspected COVID-19 infection than the Word2Vec model with word treatment.On the other hand, with word treatment, the latter model was able to select a larger group of suspected cases.It is observed that the decision on the best model depends on the complementary human analysis and on the health promotion program designed.
Compared to traditional methods, it was observed that the BERTopic models did not classify suspected cases, mostly with lower severity, but which may be relevant in an integrated health care model.Thus, it reinforces the exploratory character, its intermediate use for the application of a supervised model and the need to compare results with traditional research methods.
On the other hand, the model also generated topics of interest for future studies, with special attention to suspected cases of cancer patients.
The findings demonstrate the importance of human participation -analysis of the generated topics for correct classification generating information for a supervised model, choice of the best model according to the perspective of health care management (more severe versus less severe patients), design of a health promotion program aligned with this choice and attention to the ethical aspects of the use of machine learning tools in health care.

Contributors
R. P. Silva contributed with the study conception and design, methodology, data acquisition and analysis, writing and review; and approved the final version.J. T. Pollettini contributed with the methodology, data analysis and critical review; and approved the final version. A. Pazin Filho contributed with the study design, methodology, data analysis, writing and critical review; and approved the final version.

(
ICD-10) related to the authorization informed by the requesting service provider, it is not a mandatory field Text No icd_description Description of the ICD-10 related to the authorization Text No careregimen Type of facility used for care according to provider classification -outpatient clinic, home care, day hospital care, hospitalization, and emergency room Text No treatmenttype Type of treatment -surgical, clinical, obstetric, pediatric, psychiatric, dental Text No treatmentobjective Treatment objective -diagnostic, palliative, preventive, restorative, therapeutic Text No requestdate Date of request for prior authorization Date No eventstructure Prior authorization event code.The service provider indicates the health care event they wish to perform, which is analyzed by the health care provider and authorized Text No eventdescription Description of the authorized event Text No clinicalindication Field informed by the service provider which contains the justification for requesting the procedure of the prior authorization.This field is not mandatory and is supported by other information submitted as attachments.It is a free text field without any kind of automatic validation Text Yes *

4
. The topics formed by each model are shown in the Box 4, 5, 6 and 7. Cad.Saúde Pública 2023; 39(11):e00243722 Box Number of prior authorizations by topics generated by the BERTopic +500 model without word treatment and respective description of the authors.

Box 5 ICD- 10 :
Number of prior authorizations by topics generated by the BERTopic +500 model with word treatment and respective description of the authors.International Classification of Diseases, 10th revision.Note: topic -1 is considered "outlier" according to the model.* Includes all authorizations including zeroed values; ** Topics automatically generated by the model; *** Qualitative analysis of the name generated by the topic by the authors.

Box 7
Number of prior authorizations by topics generated by the BERTopic +1,000 model with word treatment and respective description of the authors.ICD-10: International Classification of Diseases, 10th revision.Note: topic -1 is considered "outlier" according to the model.* Includes all authorizations including zeroed values; ** Topics automatically generated by the model; *** Qualitative analysis of the name generated by the topic by the authors.

Table 1
Number of prior authorizations analyzed by type of treatment according to supplementary health care provider authorization care regimen.São Paulo, Brazil, September/2019 to June/2022.

Table 2
Number of prior authorizations analyzed by treatment objective according to care regimen of the supplementary health care provider authorizations.

Table 3
Models and characteristics of prior authorizations paid according to suspected COVID-19 infection and outliers of authorizations issued by a supplementary health care provider.São Paulo, Brazil, September/2019 to June/2020.

Models Topics (n) Outliers * Suspected COVID-19 infection authorization topics
Evaluation of the BERTopic +1,000 model without treatment by manual classification of the 15 authorizations ordered by cost of suspected cases of COVID-19 infection in a supplementary health care provider.São Paulo, Brazil, September/2019 to June/2022.
(continues)Cad.Saúde Pública 2023; 39(11):e00243722 * Author classification based on analysis of the "clinicalindication" field.The classification was independent of the classification generated by the model.Cases classified as COVID-19 indicate suspected infection of a patient whose previous authorization was issued under the terms contained in the clinical indication.Evaluation of the BERTopic +1,000 model without treatment by manual classification of the 15 authorizations ordered by cost of suspected cases of COVID-19 infection in a supplementary health care provider.São Paulo, Brazil, September/2019 to June/2022.

Table 4
Compares models with traditional word selection method in the classification of authorizations issued by a supplementary health care provider.São

Models Prior authorization classified in the model Lost when compared to traditional method Model found but traditional method lost
Number of prior authorizations by topics generated by the BERTopic +1,000 model without word treatment and respective description of the authors.
Cad. Saúde Pública 2023; 39(11):e00243722 indicated only 1 NLP-based study for early diagnosis or patient screening.Most studies (65 of 78) used chest image processing techniques.The authors indicate that most studies analyzed are still in the research phase and few are used for decision-making 35 .A specific review on salud mediante el aprendizaje de máquina.COVID-19; Procesamiento de Lenguaje Natural; Atención a la Salud; Criterios de Seleción de Pacientes; Instituciones Privadas de Salud Submitted on 19/Jan/2023 Final version resubmitted on 26/Jun/2023 Approved on 04/Jul/2023